Packages Installation

# Make sure, that StringsAsFactors = FALSE
options(stringsAsFactors = F)

if(!require(install.load)){
  install.packages("install.load")
  library(install.load)
}

install_load("tidyverse", "moments", "plotly", "data.table", "fitdistrplus", "knitr")
library(tidyverse)
library(moments)
library(plotly)
library(data.table)
library(fitdistrplus)
library(knitr)

General Task

Task 1

Data Import

In this task, we have to create a distribution of logistic delay of component K7. Two data sets were provided, which are Komponente_K7.csv, which contains production date, and Logistikverzug_K7.csv, which contains incoming date of the product. To start analyzing the data, we must import the data sets and set their column types accordingly.

# For semicolon delimited
K7 <- read_delim("Data/Logistikverzug/Komponente_K7.csv",delim = ";",show_col_types = FALSE)[2:6]
K7$Fehlerhaft <- as.logical(K7$Fehlerhaft)
K7$Produktionsdatum <- as.Date(K7$Produktionsdatum,
                                          format = "%Y-%m-%d")  
# For comma delimited
LK7 <- read_delim("Data/Logistikverzug/Logistikverzug_K7.csv",delim = "," ,show_col_types = FALSE)[2:6]
LK7$Fehlerhaft <- as.logical(LK7$Fehlerhaft)
LK7$Wareneingang <- as.Date(LK7$Wareneingang,
                                          format="%Y-%m-%d")

Data Preparation

After importing the data, we check if the column Fehlerhaft in both data sets are equal.

Result<- K7$Fehlerhaft == LK7$Fehlerhaft
Result[FALSE]
## logical(0)

Based on this result, the column Fehlerhaft are in both data sets equal. Now, both data sets K7 and LK7 can be combined by using full_join(). A column with the name Wochentag is created for the day of production date. Afterwards, all NA values in the column Produktionsdatum and Wareneingang are filtered. A column Warenausgang is created, if the day of production date is Friday and Saturday. If the production day is Friday and Saturday, the product can be sent only on Monday, hence the outgoing date equals to production date added by 3 or 2 respectively. For outgoing date of all other days, the production date is added by 1.

logistics_delay <- K7 %>% 
  full_join(LK7, by = c("IDNummer","Herstellernummer", "Werksnummer","Fehlerhaft")) %>%
  mutate(Wochentag = weekdays(Produktionsdatum)) %>%  # Creates a new column Wochentag
  filter(!is.na(Produktionsdatum)) %>% # Filter NA values
  filter(!is.na(Wareneingang)) %>% # Filter NA values
  mutate(Warenausgang = ifelse(Wochentag == "Freitag",
                               Produktionsdatum +3,
                               ifelse(Wochentag == "Samstag",
                                      Produktionsdatum +2,
                                      Produktionsdatum +1))) 

The column Warenausgang can now be formatted in date-format. Logistic delay can then be calculated by substracting the incoming date (Wareneingang) with the outgoing date (Warenausgang).

logistics_delay <- logistics_delay %>%
  mutate(Warenausgang = as.Date(Warenausgang,
                                origin = "1970-01-01",
                                format="%Y-%m-%d")) %>% 
  mutate(Verzug = Wareneingang - Warenausgang)

Task 1.a - Logistics Distribution

To determine the distribution of the logistic delay, we use the function descdist from the package fitdistrplus. To use this, the column Verzug needs to be converted in numeric. Afterwards, the column is analyzed for discrete distribution, since the data contains integer values.

# Convert Verzug column in numeric
logistics_delay$Verzug <- as.numeric(logistics_delay$Verzug) 

# Determine the logistics distribution
descdist(logistics_delay$Verzug, discrete = TRUE)

## summary statistics
## ------
## min:  3   max:  14 
## median:  6 
## mean:  6.080441 
## estimated sd:  1.012306 
## estimated skewness:  0.567401 
## estimated kurtosis:  3.630036

According to the Cullen and Frey graph, the logistics data follows negative binomial distribution. The skewness value is to analyze the symmetrie or the lack of symmetrie of the distribution. Based on the fact that the skewness value is positive and the mean value is slightly larger than median, the distribution is positively skewed, i.e. the tail is on the left side of the distribution (Source: https://www.vrcbuzz.com/karl-pearsons-coefficient-of-skewness-using-r-with-examples/). The kurtosis value is greather than 3, it is said to be leptokurtic.This means that there are more chances to be outliers and that the distribution is peakead and haslong tails (thick).

Task 1. b - Logistics Delay Mean

To calculate the mean, following code is implemented.

mean <- as.numeric(mean(logistics_delay$Verzug)) 

# Ausgeben der Ergebnisse
cat("Mean = ", mean)
## Mean =  6.080441

Task 1.c - Visualization

To determine the bin width for our histogram, we calculate the difference between Max, in our data equals to 14, and Min value, in our data equals to 1, of the delays and divide it by the total number of bins, in this data equals to 14. So the bin width is:

binwidth <-(14-1)/14
cat("Binwidth = ", binwidth)
## Binwidth =  0.9285714

Since the calculated binwidth is 0,9, it would make much more sense to round it to 1, because the delay days are integer numbers. The data can be plotted as follows:

logistics_delay_plot <- ggplot(logistics_delay, aes(x = Verzug ))+
  geom_histogram(aes(y = stat(density)),colour="black", fill="white", binwidth=1)+
  scale_x_continuous(breaks = c(0:14))+
  geom_density( fill="#FF6666",adjust = 10,alpha = 0.5) 


ggplotly(logistics_delay_plot)

Task 1.d - Decision Tree

To create a decision tree, its important to know what and how many attributes there are, that influence the defect of component K7. Based on the data set K7, there are four columns that could be relevant. These are IDNummer, Produktionsdatum, Herstellernummer and Werksnummer. It could be noticed, that the IDNummer, Hersteller, and Werksnummer contain similar information. Herstellernummer and Werksnummer are directly connected, hence both should not be considered as different attribute. Inside the IDNummer, Herstellernummer and Werksnummer are also included and followed by the serie number of production. Hence, the IDNummer should also not be included as a different attribute. Based on this, the defect of component K7 is attributed solely from Produktionsdatum and Herstellernummer. The data set could be summarized as follow:

K7_decision_tree <- K7 %>% 
  dplyr::select(Produktionsdatum, Herstellernummer, Fehlerhaft) %>% 
  arrange(desc(Fehlerhaft))

head(K7_decision_tree)
## # A tibble: 6 × 3
##   Produktionsdatum Herstellernummer Fehlerhaft
##   <date>                      <dbl> <lgl>     
## 1 2011-02-02                    114 TRUE      
## 2 2011-06-02                    114 TRUE      
## 3 2013-10-29                    112 TRUE      
## 4 2014-01-14                    112 TRUE      
## 5 2014-06-02                    112 TRUE      
## 6 2014-11-20                    112 TRUE
tail(K7_decision_tree)
## # A tibble: 6 × 3
##   Produktionsdatum Herstellernummer Fehlerhaft
##   <date>                      <dbl> <lgl>     
## 1 2016-11-12                    112 FALSE     
## 2 2016-11-12                    112 FALSE     
## 3 2016-11-12                    112 FALSE     
## 4 2016-11-12                    112 FALSE     
## 5 2016-11-12                    112 FALSE     
## 6 2016-11-13                    112 FALSE

The next step would be to determine, which of the attributes can be considered as root characteristics. To do this, we have to measure Information Gain value \(IG(D,A)\) by applying each attribute A of both Produktionsdatum and Herstellernummer to the destination attribute D, which is in this case Fehlerhaft . This can be calculated by substracting \(H(D)\), which is the entropy of the datset of destination attribute D, with \(Rest(A)\), which is the remaining entropy that is still there after testing attribute A.

To calculate entropy of the dataset D, following equation is used, whereby the variable \(q\) describes the proportion of defective components to the total number of components:

Afterwards, we can calculate the remaining entropy of each attributes (Produktionsdatum and Herstellernummer) by using following equation:

For this, \(p_{k}\) and \(n_{k}\) corresponds with the number of defective and non-defective components in correlation with each component of corresponding attributes. \(q_{k}\) can be calculated by dividing \(p_{k}\) with the sum of \(p_{k}\) and \(n_{k}\). By using this equation, we can calculate the remaining entropy for both Produktionsdatum - \(Rest(Produktionsdatum)\) - and Herstellersnummer - \(Rest(Herstellersnummer)\) and hence determine the Information Gain for both \(IG(D, Produktionsdatum\) and \(IG(D, Produktionsdatum)\). The attribute with the largest information gain would then be selected as the root attribute and the other as the sub-attribute. The decision tree may look as follow.

Task 2 Data Base Structure

It is reasonable to save data in separate files instead of one huge table, because:

  1. More stable in term of losing information, if user somehow accidentally deletes 1 file, it won’t affect all other tables.
  2. The user can do pick and choose, that means the user can decide which data they want to read and there is not need to read and load the whole table.
  3. NULL values would not then be accumulated. In our case study, if we create only 1 data set containing all vehicles with their parts, some vehicles would have T16 part, but the majority doesn’t, which would result in to a data set with almost full column of NULL values.
  4. It is quicker to find information in separate files, because the size of separate file is smaller than the size of the whole table.
  5. More Flexibility in terms of memory because each file can be store on different servers.
  6. It is more secure. In one hand, it would be possible to manage the access to the data therefore the users would get permissions to what they need and not to the whole data set. On the other hand, if an account gets hacked, the hacker would only see part of the information but the the whole data.

Provided structure represents a relational database structure. It contains reference keys, binding tables and built on table basis.

Task 3 Components

To get the Cars that were registered in ADELSHOFEN, we read the file “Zulassung” and we filter by Gemeinden ADELSHOFEN

# Reading Zulassung File
zulassung <- read_delim("Data/Zulassungen/Zulassungen_alle_Fahrzeuge.csv",delim = ";",show_col_types = FALSE)
zulassung_adel=zulassung%>%filter(Gemeinden =="ADELSHOFEN")

To determine the number of cars that were build with T16 parts and got registered in ADELSHOFEN, it is necessary to know the id of the cars that have a T16 part.To start with we read the file “Bestandteile Komponente” and determine the components that have T16 parts.

# Reading T16 Teil Komponente

T16_Kom_1 <- read_delim("Data/Komponente/Bestandteile_Komponente_K2LE2.csv",
                        delim =";",
                        show_col_types = FALSE) %>%
  dplyr::select("ID_T16","ID_K2LE2")

names(T16_Kom_1)[2] <- "ID_Sitze"

T16_Kom_2 <- read_delim("Data/Komponente/Bestandteile_Komponente_K2ST2.csv",
                        delim =";",
                        show_col_types = FALSE) %>%
  dplyr::select("ID_T16","ID_K2ST2")

names(T16_Kom_2)[2]<-"ID_Sitze"

Once we have the components, we need to find out the car ids. For that we read the file “Bestandteile Fahrzeuge” and we combine the data (Bestandteil Komponente with Komponente Farhzeug).

Kom_K2LE2<- read.delim("Data/Komponente/Komponente_K2LE2.txt",sep ="\\")%>%dplyr::select(c("ID_Sitze","Werksnummer")) # not need to read it doesnt give any extra info
Kom_K2ST2<-read_delim("Data/Komponente/Komponente_K2ST2.csv",delim = ";",show_col_types = FALSE)%>%dplyr::select(c("ID_Sitze","Werksnummer"))
T16_K2LE2<-T16_Kom_1%>%left_join(Kom_K2LE2,by="ID_Sitze")
T16_K2ST2<-T16_Kom_2%>%left_join(Kom_K2ST2,by="ID_Sitze")
T16_Kom_bind<-rbind(T16_K2LE2,T16_K2ST2)
#read Bestandteil Fahrzeuge and combine them
Kom_Fahr_1<-read_delim("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM2_Typ21.csv",delim = ";",show_col_types = FALSE)%>%dplyr::select(c("ID_Sitze","ID_Fahrzeug"))
Kom_Fahr_2<-read_delim("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM2_Typ22.csv",delim =";",show_col_types = FALSE)%>%dplyr::select(c("ID_Sitze","ID_Fahrzeug"))
Kom_Fahr_bind<-rbind(Kom_Fahr_1,Kom_Fahr_2)

Finally we merge our data “Bestandteile Fahrzeuge” with the”Zulassung” data filter by ADELSHOFEN

#join the different datasets to create a dataset that only contains the IDs of the components that where produced in ADELSHOFEN
Kom_Fahr <- T16_Kom_bind %>%left_join(Kom_Fahr_bind, by ="ID_Sitze")
Kom_Fahr_Adel <-Kom_Fahr%>%inner_join(zulassung_adel,by =c("ID_Fahrzeug"="IDNummer"))

Number of Cars register in ADELSHOFEN:

cat("Cars registered in Adelshofen is ",nrow(Kom_Fahr_Adel))
## Cars registered in Adelshofen is  48

Task 4 Data Type Attributes

The data type of the table “Zulassung” is shown with the str() function

str(zulassung)
## spec_tbl_df [3,204,104 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1     : num [1:3204104] 408097 408098 1 2 3 ...
##  $ IDNummer : chr [1:3204104] "11-1-11-1" "11-1-11-2" "12-1-12-1" "12-1-12-2" ...
##  $ Gemeinden: chr [1:3204104] "DRESDEN" "DRESDEN" "LEIPZIG" "LEIPZIG" ...
##  $ Zulassung: Date[1:3204104], format: "2009-01-01" "2009-01-01" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   IDNummer = col_character(),
##   ..   Gemeinden = col_character(),
##   ..   Zulassung = col_date(format = "")
##   .. )
##  - attr(*, "problems")=<externalptr>

The data types are explained in the following table:

Attribute Data.Type Description
…1 Numeric This data type contains numbers that represents unique index of each registeration
IDNummer Character This data type contains strings with vehicle type, its OEM number, its factory, and its unique index
Gemeinden Character This data type contains strings of registration location
Zulassung Date This data type contains date of registration

Task 5 Server

Q: Why does it make sense to store the records on the database of a server?

A: To save the records on the database of a server has the advantage that no local memory is occupied but more importantly, a permanent access to the most recently data records can be guaranteed in most cases , with a few exception as for example when maintenance takes place.

Q: Why can’t you store the records on your personal computer?

A: Since databases tend to store a large amount of data, this could cause the computer to slow down. Also the quality of the data deteriorate very fast, since the data doesn’t get updated.

Q: What is an easy way to make your application available to your customers?

A: “RStudio Shiny Server Open Source” allows customers to get access to the app from anywhere with any web browser. To share the app, the code can be uploaded to git, for example, as a public repository.

Task 6 Car Registration

In order to find out where car with the chassis “K5-112-1122-79” got registered, it is important to know the car ID. This is possible by reading the file “Bestandteile Fahrzeuge”.

car <- read_delim("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM1_Typ12.csv",delim =";",
                  show_col_types = FALSE) %>%
  dplyr::select(c("ID_Karosserie","ID_Fahrzeug")) %>%
  filter(ID_Karosserie == "K5-112-1122-79")

car_registerd <- filter(zulassung, IDNummer == car$ID_Fahrzeug)
gemeinden <- as.character(car_registerd$Gemeinden)
cat("Car with ID Karosserie'K5-112-1122-79' got registered in ",
    gemeinden) 
## Car with ID Karosserie'K5-112-1122-79' got registered in  ASCHERSLEBEN